In this homework assignment, we will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
To build a multiple linear regression model on the training data to predict the number of wins for the team. We can only use the variables provided (or variables that we will derive from the variables provided).
The dataset consists of two data files: training and evaluation. The training dataset contains 17 columns, while the evaluation dataset contains 16. The evaluation dataset is missing column TARGET_WINS. We will start by exploring the training data set since it will be the one used to generate the regression model.
First we see that all data is numeric.
An important aspect of any dataset is to determine how much, if any, data is missing. We look at all the variables to see which if any have missing data. We look at the percentages of the data that are missing:
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | na_count | na_count_perc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TARGET_WINS | 2 | 191 | 80.92670 | 12.115013 | 82 | 81.11765 | 13.3434 | 43 | 116 | 73 | -0.1698314 | -0.2952783 | 0.8766116 | 0 | 0.0 |
| TEAM_BATTING_H | 3 | 191 | 1478.62827 | 76.147869 | 1477 | 1477.42484 | 74.1300 | 1308 | 1667 | 359 | 0.1302702 | -0.3710350 | 5.5098664 | 0 | 0.0 |
| TEAM_BATTING_2B | 4 | 191 | 297.19895 | 26.329335 | 296 | 296.62745 | 25.2042 | 201 | 373 | 172 | 0.0915189 | 0.4778716 | 1.9051238 | 0 | 0.0 |
| TEAM_BATTING_3B | 5 | 191 | 30.74346 | 9.043878 | 29 | 30.13072 | 8.8956 | 12 | 61 | 49 | 0.7007420 | 0.7446217 | 0.6543921 | 0 | 0.0 |
| TEAM_BATTING_HR | 6 | 191 | 178.05236 | 32.413243 | 175 | 176.81046 | 35.5824 | 116 | 260 | 144 | 0.2980673 | -0.7172373 | 2.3453399 | 0 | 0.0 |
| TEAM_BATTING_BB | 7 | 191 | 543.31937 | 74.842133 | 535 | 541.31373 | 74.1300 | 365 | 775 | 410 | 0.3115199 | -0.1474175 | 5.4153867 | 0 | 0.0 |
| TEAM_BATTING_SO | 8 | 191 | 1051.02618 | 104.156382 | 1050 | 1046.95425 | 97.8516 | 805 | 1399 | 594 | 0.3985050 | 0.3955105 | 7.5364913 | 102 | 4.5 |
| TEAM_BASERUN_SB | 9 | 191 | 90.90576 | 29.916401 | 87 | 89.06536 | 29.6520 | 31 | 177 | 146 | 0.5553966 | -0.1414909 | 2.1646748 | 131 | 5.8 |
| TEAM_BASERUN_CS | 10 | 191 | 39.94241 | 11.898334 | 38 | 39.49020 | 11.8608 | 12 | 74 | 62 | 0.3468509 | 0.0006392 | 0.8609332 | 772 | 33.9 |
| TEAM_BATTING_HBP | 11 | 191 | 59.35602 | 12.967123 | 58 | 58.86275 | 11.8608 | 29 | 95 | 66 | 0.3185754 | -0.1119828 | 0.9382681 | 2085 | 91.6 |
| TEAM_PITCHING_H | 12 | 191 | 1479.70157 | 75.788625 | 1480 | 1478.50327 | 72.6474 | 1312 | 1667 | 355 | 0.1279056 | -0.3894781 | 5.4838725 | 0 | 0.0 |
| TEAM_PITCHING_HR | 13 | 191 | 178.17801 | 32.391678 | 175 | 176.93464 | 35.5824 | 116 | 260 | 144 | 0.2989191 | -0.7190905 | 2.3437795 | 0 | 0.0 |
| TEAM_PITCHING_BB | 14 | 191 | 543.71728 | 74.916681 | 537 | 541.74510 | 72.6474 | 367 | 775 | 408 | 0.3144366 | -0.1338563 | 5.4207808 | 0 | 0.0 |
| TEAM_PITCHING_SO | 15 | 191 | 1051.81675 | 104.347208 | 1052 | 1047.80392 | 97.8516 | 805 | 1399 | 594 | 0.3945586 | 0.3903991 | 7.5502990 | 102 | 4.5 |
| TEAM_FIELDING_E | 16 | 191 | 107.05236 | 16.632162 | 106 | 106.58170 | 17.7912 | 65 | 145 | 80 | 0.1780432 | -0.3567367 | 1.2034610 | 0 | 0.0 |
| TEAM_FIELDING_DP | 17 | 191 | 152.33508 | 17.611682 | 152 | 152.04575 | 19.2738 | 113 | 204 | 91 | 0.2164822 | -0.2115741 | 1.2743366 | 286 | 12.6 |
From this result we can see how several variables have a number of missing values. The maximum number of missing values was 2085 in the TEAM_BATTING_HBP variable. This is a significant amount of missing data representing 91.6% of that data.
With missing data asseced, we can look into descriptive statistics in more detail. Interestingly we find that the difference between means and medians is fairly small for all data columns. The maximum difference is in fact only 4.77%. This means that we are to expect the distributions of this data to be fairly uniform. To visualize this we plot histograms for each data.
The plot of distributions does show fairly uniform data, but it also show the potential precense of outliers in at least two of the predictors. This is not the best way to vizualise ouliers. Instead we identify the predictors which seem to have outliers by looking at the scattered and box plots. Two variables with outliers appera to be TEAM_PITCHING_H, TEAM_PITCHING_SO, TEAM_PITCHING_BB and TEAM_FIELDING_E. We highlight these variables from the desity plots since we can see most of the data concentrated at the lower end of the scales which show tailing off to high values.
TEAM_PITCHING_H
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1137 1419 1518 1779 1682 30132
From the summary for this variable we see that the maximum is 3.013210^{4}, which is substantially far from the median of 1518. We can also see a wide spread between mean and median of 261. But we know the maximum is not a realistic number. We can compare the numbers of pitching hits, to the number of batting hits, summary for which is shown below:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 891 1383 1454 1469 1537 2554
For batting the maximum is 2554, much lower than 3.013210^{4}. We expect these two variables to somewhat equal values since one is the reciprocal of the other. As a saminity check of the distribution of the batting variable, we now see the mean and median much closer together with a spread of only 15.
TEAM_PITCHING_SO
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 615.0 813.5 817.7 968.0 19278.0 102
Again looking at the maximum we find it is unreasonably high at 1.927810^{4}, which again is substantially far from the median of 813.5. Again this variable has a reciprocal in TEAM_BATTING_SO. Comparing agaisnt it confirms the outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 548.0 750.0 735.6 930.0 1399.0 102
We see that or batting the maximum is 1399, much lower than 1.927810^{4}. As before as a saminity check of the distribution of the batting variable, we now see the mean and median much closer together with a spread of only 14.4.
TEAM_PITCHING_BB
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 476.0 536.5 553.0 611.0 3645.0
Again looking at the maximum we find it is unreasonably high at 3645, which again is substantially far from the median of 536.5. Again this variable has a reciprocal in TEAM_BATTING_SO. Comparing agaisnt it confirms the outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 451.0 512.0 501.6 580.0 878.0
We see that or batting the maximum is 878, much lower than 3645. As before as a saminity check of the distribution of the batting variable, we now see the mean and median much closer together with a spread of only 10.4.
TEAM_FIELDING_E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 65.0 127.0 159.0 246.5 249.2 1898.0
Again looking at the maximum we find it is unreasonably high at 1898, which again is substantially far from the median of 159.
Looking at correlation of variables to number of wins provides some interesting data. We find some correlations that make sense from what might assume with subject knowledge of base, e.g., the number of hits and number of variables both have significant positive correlation with Wins and other statistics like stolen bases, while still positive, are not so strongly related. What is surprising though, are the pitching statistics. We would assume that a team that allowed the opposing team more hits, would lose more games (and win less), but that is not what the data shows us. Perhaps there are outliers swaying the correlation.
Regardless, we can use some of these correlations to drive initial models later, in terms of likely fields to choose for an effective model.
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## [1,] 1 0.3887675 0.2891036 0.1426084
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## [1,] 0.1761532 0.2325599 NA NA
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## [1,] NA NA -0.1099371 0.1890137
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## [1,] 0.1241745 NA -0.1764848 NA
Are any of the variables missing and need to be imputed “fixed”?
First task under data preparation will be to eliminate all missing data. In the Data Exploration section we found one variable, TEAM_BATTING_HBP with an exceptionaly high percentage of missing data, so we commence by eliminating this variable. We also remove the “INDEX” column as that is not used.
Next task is to handle missing data in the other variables. Here, becouse the percentages of missing data are lower, we can replace missing data with the median. We prefer replacing with median instead of mean because the latter is more sensitive to outliers. So we get a clean dataset without missing values.
Note we also consider zeros to be missing data. Since each row is a season of data for a given baseball team, it would be extraordinarily unlikely that any of these statistics would have zero as an actual value. Therefore we are assuming zero is another indicator of missing value and we will transform them into a median value.
In the exploratory phase we also identified several variables with outliers. Outliers will be substituted with median. Again we choose median becouse it is less influenced by these outliers. What cutoff to use to tag an outlier reading could be a 3 standard deviation from the mean, or 1.5 time the inter quartile range, but in this case becouse these variables have reciprocals as seen in the exploratory phase, we will use the maximum reading of those variables.
TEAM_PITCHING_H
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1137 1419 1518 1562 1636 2544
From the summary we now see that the maximum is 2544, which is a much more reasonable number. We can also see a wide spread between mean and median of 44, indicating a more normal distribution than before.
TEAM_PITCHING_SO
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 181.0 633.0 816.0 796.8 948.2 1399.0
From the summary we now see that the maximum is 1399, which is a much more reasonable number. We can also see a wide spread between mean and median of -19, indicating a more normal distribution than before.
Lastly, before we create models, lets divide data into test and training sets, with 80% for training, 20% for test. This way we have a method to validate our models.
Batting only model
Combine all batting variables.
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.157 -8.688 0.679 9.599 45.949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.472589 3.369651 5.185 2.4e-07 ***
## TEAM_BATTING_H 0.043178 0.002281 18.927 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.4 on 1818 degrees of freedom
## Multiple R-squared: 0.1646, Adjusted R-squared: 0.1642
## F-statistic: 358.2 on 1 and 1818 DF, p-value: < 2.2e-16
Pitching only model
Combine all pitching variables.
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_H + TEAM_PITCHING_HR,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.379 -9.340 0.787 9.917 67.847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.476611 2.691696 19.496 < 2e-16 ***
## TEAM_PITCHING_H 0.015148 0.001626 9.319 < 2e-16 ***
## TEAM_PITCHING_HR 0.045438 0.005863 7.750 1.51e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.2 on 1817 degrees of freedom
## Multiple R-squared: 0.06883, Adjusted R-squared: 0.06781
## F-statistic: 67.16 on 2 and 1817 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ HITS_NOHR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.111 -8.461 0.808 10.497 42.679
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.86903 3.00594 12.60 <2e-16 ***
## HITS_NOHR 0.03144 0.00218 14.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.92 on 1818 degrees of freedom
## Multiple R-squared: 0.1027, Adjusted R-squared: 0.1022
## F-statistic: 208 on 1 and 1818 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ HITS_NOHR + TEAM_BATTING_BB + TEAM_FIELDING_E,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.256 -9.176 0.192 9.555 53.546
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.419903 3.533612 3.232 0.00125 **
## HITS_NOHR 0.047475 0.002322 20.445 < 2e-16 ***
## TEAM_BATTING_BB 0.018140 0.003524 5.148 2.92e-07 ***
## TEAM_FIELDING_E -0.018679 0.002141 -8.723 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.94 on 1816 degrees of freedom
## Multiple R-squared: 0.2172, Adjusted R-squared: 0.2159
## F-statistic: 168 on 3 and 1816 DF, p-value: < 2.2e-16
## Warning in abline(hitsNoHR_bb_e_mod): only using the first two of 4
## regression coefficients
Best in terms of residuals and Rsquared Hits, BB, and Fielding Errors. Plost look good except for short tailed issues in the QQ plot.
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB +
## TEAM_FIELDING_E, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.767 -8.959 0.003 9.071 50.788
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.277272 3.639218 1.175 0.240018
## TEAM_BATTING_H 0.050269 0.002305 21.813 < 2e-16 ***
## TEAM_BATTING_BB 0.012509 0.003507 3.567 0.000371 ***
## TEAM_FIELDING_E -0.014193 0.002022 -7.018 3.17e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.77 on 1816 degrees of freedom
## Multiple R-squared: 0.237, Adjusted R-squared: 0.2357
## F-statistic: 188 on 3 and 1816 DF, p-value: < 2.2e-16
## Warning in abline(hits_bb_e_mod): only using the first two of 4 regression
## coefficients
Attempt at boxcox, didn’t achieve better results in R squared (not SE not directly comparable due to adjustment with boxcox) The QQ plot seems to look a bit better as the negative quantiles are much closer to the line. Box Cox
##
## Call:
## lm(formula = TARGET_WINS_BC ~ TEAM_BATTING_H + TEAM_BATTING_BB +
## TEAM_FIELDING_E)
##
## Residuals:
## Min 1Q Median 3Q Max
## -194.979 -38.736 -1.184 37.665 223.257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.017547 15.331986 -4.175 3.12e-05 ***
## TEAM_BATTING_H 0.211530 0.009709 21.787 < 2e-16 ***
## TEAM_BATTING_BB 0.053146 0.014776 3.597 0.000331 ***
## TEAM_FIELDING_E -0.051346 0.008520 -6.026 2.03e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57.99 on 1816 degrees of freedom
## Multiple R-squared: 0.2317, Adjusted R-squared: 0.2305
## F-statistic: 182.6 on 3 and 1816 DF, p-value: < 2.2e-16